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Abstract 

We study losses for binary classification and class probability estimation and 
extend the understanding of them from margin losses to general composite losses 
which are the composition of a proper loss with a link function. We characterise 
when margin losses can be proper c;onipositc losses, explicitly show how to determine 
a symmetric loss in full from half of one of its partial losses, introduce an intrinsic 
parametrisation of composite binary losses and give a complete characterisation of 
the relationship between proper losses and "classification calibrated" losses. We 
also consider the question of the "best" surrogate binary loss. We introduce a 
precise notion of "best" and show there exist situations where two convex surrogate 
losses are incommensurable. We provide a complete explicit characterisation of the 
convexity of composite binary losses in terms of the link function and the weight 
function associated with the proper loss which make up the composite loss. This 
characterisation suggests new ways of "surrogate tuning" . Finally, in an appendix 
we present some new algorithm-independent results on the relationship between 
properness, convexity and robustness to misclassification noise for binary losses and 
show that all convex proper losses are non-robust to misclassification noise. 

1 Introduction 

A loss function is the means by which a learning algorithm's performance is judged. A 
binary loss function is a loss for a supervised prediction problem where there are two 
possible labels associated with the examples. A composite loss is the composition of a 
proper loss (defined below) and a link function (also defined below). In this paper we 
study composite binary losses and develop a number of new characterisation results. 

biformally, proper losses are well-calibrated losses for class probability estimation, 
that is for the problem of not only predicting a binary classification label, but providing 
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an estimate of the probability that an example will have a positive label. Link functions 
are often used to map the outputs of a predictor to the interval [0, 1] so that they can be 
interpreted as probabilities. Having such probabilities is often important in applications, 
and there has been considerable interest in understanding how to get accurate probability 
estimates [36] [IHl [10] and understanding the implications of requiring loss functions 
provide good probability estimates [6j. 

Much previous work in the machine learning literature has focussed on margin losses 
which intrinsically treat positive and negative classes symmetrically. However it is now 
well understood how important it is to be able to deal with the non-symmetric case 
[21 [121 El IHl EZ] • A key goal of the present work is to consider composite losses in the 
general (non-symmetric) situation. 

Having the flexibility to choose a loss function is important in order to "tailor" the 
solution to a machine learning problem; confer [181 IHl IH] • Understanding the structure 
of the set of loss functions and having natural parametrisations of them is useful for this 
purpose. Even when one is using a loss as a surrogate for the loss one would ideally like 
to minimise, it is helpful to have an easy to use parametrisation — see the discussion of 
"surrogate tuning" in the Conclusion. 

The paper is structured as follows. In f|2] we introduce the notions of a loss, the 
conditional and full risk which we will make extensive use of throughout the paper. 

In ^we introduce losses for Class Probability Estimation (CPE), define some tech- 
nical properties of them, and present some structural results. We introduce and exploit 
Savage's characterisation of proper losses and use it to characterise proper symmetric 
CPE-losses. 

In ^we define composite losses formally and characterise when a loss is a proper com- 
posite loss in terms of its partial losses. We introduce a natural and intrinsic parametri- 
sation of proper composite losses and characterise when a margin loss can be a proper 
composite loss. We also show the relationship between regret and Bregman divergences 
for general composite losses. 

In ^ we characterise the relationship between classification calibrated losses (as 
studied for example by Bartlett et al. [7J and proper composite losses. 

In ^ motivated by the question of which is the best surrogate loss, we characterise 
when a proper composite loss is convex in terms of the natural parametrisation of such 
losses. 

In ^ we study surrogate losses making use of some of the earlier material in the 
paper. A surrogate loss function is a loss function which is not exactly what one wishes 
to minimise but is easier to work with algorithmically. We define a well founded notion of 
"best" surrogate loss and show that some convex surrogate losses are incommensurable 
on some problems. We also study other notions of "best" and explicitly determine the 
surrogate loss that has the best surrogate regret bound in a certain sense. 

Finally, in ^we draw some more general conclusions. 

Appendix |C] builds upon some of the results in the main paper and presents some 
new algorithm- independent results on the relationship between properness, convexity 
and robustness to misclassification noise for binary losses and shows that all convex 
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proper losses are non-robust to misclassification noise. 

2 Losses and Risks 

We write x A y := m.m{x,y) and [[p] = 1 if p is true and [pj = otherwis^ The 
generalised function 5{-) is defined by j^5{x)f{x)dx = /(O) when / is continuous at 
and a < < 6. Random variables are written in sans-serif font: X, Y. 

Given a set of labels y := {—1,1} and a set of prediction values V we will say a 
loss is any functioiij^^ : y x V ^ [0,oo). We interpret such a loss as giving a penalty 

v) when predicting the value v when an observed label is y. We can always write an 
arbitrary loss in terms of its partial losses li := •) and i-i := — •) using 

e{y,v) = ly = Ijhiv) + ly = -lj£-i{v). (1) 

Our definition of a loss function covers all commonly used margin losses {i.e. those 
which can be expressed as i{y,v) = (p{yv) for some function : M — > [0,oo)) such as 
the 0-1 loss £{y,v) = lyv > Oj, the hinge loss £{y,v) = max(l — yv,0), the logistic 
loss i{y,v) = log(l -|- e^^), and the exponential loss l{y,v) = e~^^ commonly used in 
boosting. It also covers class probability estimation losses where the predicted values 
fj £ V = [0,1] are directly interpreted as probability estimatesj^ We will use fj instead of 
V as an argument to indicate losses for class probability estimation and use the shorthand 
CPE losses to distinguish them from general losses. For example, square loss has partial 
losses ^-1(17) = fj'^ and = (1— r))^, the log loss i~i{fi) = log(l— ?}) and ^1(77) = log(f}), 
and the family of cost-weighted misclassification losses parametrised by c G (0, 1) is given 
by 

£c{-l,fi) = clv>cj and4(l,^]) = (l-c)Ir7<cl. (2) 
2.1 Conditional and Full Risks 

Suppose we have random examples X with associated labels Y G {—1,1} The joint 
distribution of (X, Y) is denoted P and the marginal distribution of X is denoted M. Let 
the observation conditional density r/(x) := Pr(Y = 1|X = x). Thus one can specify an 
experiment by either P or (r/, M). 

If r/ G [0, 1] is the probability of observing the label y = 1 the point-wise risk (or 
conditional risk) of the estimate v £ V is defined as the //-average of the point-wise risk 
for v: 

L{rj, v) := Ey^^[£{y, v)] = nh{v) + (1 - r^)£-i{v). 

Here, Y ~ 77 is a shorthand for labels being drawn from a Bernoulli distribution with 
parameter rj. When r/ : X ^ [0, 1] is an observation-conditional density, taking the M- 
average of the point-wise risk gives the (full) risk of the estimator now interpreted as 

^This is the Iverson bracket notation as recommended by Knuth [27| . 

^Restricting the output of a loss to [0, 00) is equivalent to assuming the loss has a lower bound and 
then translating its output. 

^ These are known as scoring rules in the statistical literature |16| . 
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a function w : X — > V: 

L(7?, V, M) := ¥.x^m[L{v{X),v{X))]. 

We sometimes write L(f,P) for L(r/, w,M) where {rj,M) corresponds to the joint dis- 
tribution P. We write i, L and L for the loss, point- wise and full risk throughout this 
paper. The Bayes risk is the minimal achievable value of the risk and is denoted 

L(77, M) := ini L(r/, v, M) = Ex^m [L{r}{X))] , 

where 

[0, 1] 9 L{r]) := inf L{r], v) 

is the point-wise or conditional Bayes risk. 

There has been increasing awareness of the importance of the conditional Bayes risk 
curve L{rj) — also known as "generalized entropy" [17J — in the analysis of losses for 
probability estimation |23| [M| [T| |32] . Below we will see how it is effectively the curvature 
of L that determines much of the structure of these losses. 

3 Losses for Class Probability Estimation 

We begin by considering CPE losses, that is, functions £ : {—1, 1} x [0, 1] [0, oo) and 
briefly summarise a number of important existing structural results for proper losses — 
a large, natural class of losses for class probability estimation. 

3.1 Proper, Fair, Definite and Regular Losses 

There are a few properties of losses for probability estimation that we will require. If 
fj is to be interpreted as an estimate of the true positive class probability rj {i.e., when 
y = I) then it is desirable to require that L{r], fj) be minimised hy fj = r] for all r] £ [0, 1]. 
Losses that satisfy this constraint are said to be Fisher consistent and are known as 
proper losses [HI [16]. That is, a proper loss i satisfies L{r]) = L(r], rj) for all r/ G [0, 1]. A 
strictly proper loss is a proper loss for which the minimiser of L{r], fj) over fj is unique. 
We will say a loss is fair whenever 

£_i(0) =£i(l) = 0. (3) 

That is, there is no loss incurred for perfect prediction. The main place fairness is relied 
upon is in the integral representation of Theorem |6] where it is used to get rid of some 
constants of integration. In order to explicitly construct losses from their associated 
"weight functions" as shown in Theorem [7j we will require that the loss be definite, 
that is, its point-wise Bayes risk for deterministic events {i.e., r/ = or 77 = 1) must be 
bounded from below: 

L(0) > -cx) , L(l) > -00. (4) 

Since properness of a loss ensures = L(ri, rj) we see that a fair proper loss is neces- 
sarily definite since L(0,0) = ^_i(0) = > — cxd, and similarly for L(l, 1). Conversely, if 
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a proper loss is definite then the finite values i-i(0) and ^i(l) can be subtracted from 
•) and •) to make it fair. 

Finally, for Theorem |4] to hold at the endpoints of the unit interval, we require a loss 
to be re^itlarl^ that is, 

lim rjliirj) = lim(l — r])£-i{rj) = 0. (5) 

Intuitively, this condition ensures that making mistakes on events that never happen 
should not incur a penalty. In most of the situations we consider in the remainder of 
this paper will involve losses which are proper, fair, definite and regular. 



3.2 The Structure of Proper Losses 

A key result in the study of proper losses is originally due to Shuford et al. |45j though our 
presentation follows that of Buja et al. |9]. It characterises proper losses for probability 
estimation via a constraint on the relationship between its partial losses. 

Theorem 1 Suppose I : {—1, 1} x [0, 1] ^ R is a loss and that its partial losses ii and 
i-i are both differentiahle. Then £ is a proper loss if and only if for all fj G (0, 1) 

- w{r]) (6) 



1 — f] fj 

for some weight function w : (0, 1) M"*" such that w{c) dc < oo for all e > 0. 

The equalities in (|6| should be interpreted in the Li sense. 

This simple characterisation of the structure of proper losses has a number of inter- 
esting implications. Observe from ([g]) that if £ is proper, given £i we can determine £-i 
or vice versa. Also, the partial derivative of the conditional risk can be seen to be the 
product of a linear term and the weight function: 

CoroUary 2 If £ is a differentiahle proper loss then for all r] G [0, 1] 

d 

— L(7?,57) = (1 -r/)f_i (57) + 7/^1(77) = {fi-'q)w{fi). (7) 

Another corollary, observed by Buja et al. [9], is that the weight function is related to 
the curvature of the conditional Bayes risk L. 

CoroUary 3 Let £ he a a twice differentiahl^ proper loss with weight function w defined 
in equation Then for all c G (0, 1) its conditional Bayes risk L satisfies 

w{c) = -L"{c). (8) 



as tn 



*This is equivalent to the conditions of Savage [41] and Schervish [42] . 

^ The restriction to differentiahle losses can be removed in most cases if generalised weight functions — 
that is, possibly infinite but defining a measure on (0, 1) — are permitted. For example, the weight 
function for the 0-1 loss is w{c) = 5{c — |). 



5 



One immediate consequence of this corollary is that the conditional Bayes risk for a 
proper loss is always concave. Along with an extra constraint, this gives another char- 
acterisation of proper losses [H] |38] . 

Theorem 4 (Savage) A loss function I is proper if and only if its point-wise Bayes 
risk L{r]) is concave and for each r],fi £ (0, 1) 

L{ri,fi)=L{ri) + {r]-f])lJ{fi). (9) 

Furthermore if H. is regular this characterisation also holds at the endpoints r],?] £ {0, 1}. 

This link between loss and concave functions makes it easy to establish a connection, 
as Buja et al. [9' do, between regret AL(r/, ?}) := L(r],fi) — L{r]) for proper losses and 
Bregman divergences. The latter are generalisations of distances and are defined in terms 
of convex functions. Specifically, if / : S M is a convex function over some convex set 
S C M" then its associated Bregman divergenc^is 

Df{s, so) := f{s) - f{so) -{s- sq, V/(so)) 

for any s, sq G S, where V/(so) is the gradient of / at sq. By noting that over S = [0, 1] we 
have V/ = /', these definitions lead immediately to the following corollary of Theorem]!} 

Corollary 5 If £ is a proper loss then its regret is the Bregman divergence associated 
with f = —L. That is, 

AL{7],fi) = D^Liv,fl)- (10) 

Many of the above results can be observed graphically by plotting the conditional 
risk for a proper loss as in Figure [T] Here we see the two partial losses on the left and 
right sides of the figure are related, for each fixed fj, by the linear map r] ^ L{rj, fj) = 
(1 — r])l-i{fj) + rjii^fj). For each fixed r] the properness of £ requires that these convex 
combinations of the partial losses (each slice parallel to the left and right faces) are 
minimised when fj = ij. Thus, the lines joining the partial losses are tangent to the 
conditional Bayes risk curve rj i— > L(ry) = L(ry, rj) shown above the dotted diagonal. Since 
the conditional Bayes risk curve is the lower envelope of these tangents it is necessarily 
concave. The coupling of the partial losses via the tangents to the conditional Bayes 
risk curve demonstrates why much of the structure of proper losses is determined by the 
curvature of L — that is, by the weight function w. 

The relationship between a proper loss and its associated weight function is captured 
succinctly via the following representation of proper losses as a weighted integral of the 
cost- weighted misclassification losses £c defined in Q. The reader is referred to |39] for 
the details, proof and the history of this result. 

^ A concise summary of Bregman divergences and their properties is given by Banerjee et al. [H 
Appendix A]. 
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Figure 1: The structure of the conditional risk L(r/, ?}) for a proper loss (grey surface). 
The loss is log loss and its partials = — log(f/) and £i{f]) = — log(l — i)) shown on 

the left and right faces of the box (blue curves). The conditional Bayes risk is the (green) 
curve above the dotted line fj = r]. The (red) line connecting points on the partial loss 
curves shows the conditional risk for a fixed prediction fj. 
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Table 1: Weight functions and associated partial losses. 



Boosting 



Theorem 6 Let I : y x [0, 1] 

y G y 



be a fair, proper loss Then for each fj S (0, 1) and 



ic{y,'n) wic) dc, 



(11) 



where w = —L". Conversely, if £ is defined by (11) for some weight function w : (0, 1) — > 
[0, 00) then it is proper. 

Some example losses and their associated weight functions are given in Table [T} Buja 
et al. [9] show that i is strictly proper if and only if w{c) > in the sense that w has 
non-zero mass on every open subset of (0,1). The following theorem from Reid and 
Williamson [38^ shows how to explicitly construct a loss in terms of a weight function. 

Theorem 7 Given a weight function w : [0,1] [0,oo), let W{t) = f*w{c)dc and 
W{t) = /* W{c) dc. Then the loss Iw defined by 



iw{y, v) = -W{fi) - (y - fi)W{fi) 
is a proper loss. Additionally, ifW(0) and W{1) are both finite then 

Uy,^) + {W{1) - W{0))y + W{0) 

is a fair, proper loss. 



(12) 



(13) 



Observe that if w and v are functions which differ on a set of measure zero then they 
will lead to the same loss. A simple corollary to Theorem [6] is that the partial losses are 
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given by 

hiv) = / {1 - c)w{c)dc and £^i{fi) = / cw{c)dc. (14) 
Jfj Jo 



3.3 Symmetric Losses 

We will say a loss is symmetric if ^1(57) = — ff) for all 57 E [0, 1]. We say a weight 

function for a proper loss or the conditional Bayes risk is symmetric if w{c) = w{l — c) 
or L{c) = L{\ — c) for all c G [0, 1]. Perhaps unsurprisingly, an immediate consequence 
of Theorem [T] is that these two notions are identical. 

Corollary 8 A proper loss is symmetric if and only if its weight function is symmetric. 

Requiring a loss to be proper and symmetric constrains the partial losses significantly. 
Properness alone completely specifies one partial loss from the other. Now suppose in 
addition that (. is symmetric. Combining ii{fj) = ^-i(l — fj) with ([6]) implies 

This shows that £_i is completely determined by ^-i(?)) for fj S [0, 5] (or fj G [5,1])- 
Thus in order to specific a symmetric proper loss, one needs to only specify one of the 
partial losses on one half of the interval [0,1]. Assuming £_i is continuous at ^ (or 
equivalently that w has no atoms at ^), by integrating both sides of (15) we can derive 



an explicit formula for the other half of l-i in terms of that which is specified: 

^„i(77) =^_i(i)+ / ^e_^{l-x)dx, (16) 



which works for determining on either [0, |] or 1] when £_i is specified on 1] or 
[0, 2] respectively (recalling the usual convention that = — j^^). We have thus shown: 



Theorem Q If a loss is proper and symmetric, then it is completely determined by 
specifying one of the partial losses on half the unit interval (either [0, |] or [^,0]) and 
using (15) and (16). 

We demonstrate (16) with four examples. Suppose that ^--i(r?) = for 7) E [0, 
Then one can readily determine the complete partial loss to be 



f-i('f) 



1— 77 \ 1 — Tj 



Suppose instead that ^-1(17) = for 77 G [5,1]. In that case we obtain 



^-i(^) = Ir}< -1 /2+log^j + 



> 



1 — f} 



(17) 



(18) 
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Suppose (-i{fl) = (T^-\2 for ^1 ^ [0, k]- Then one can determine that 



, [r}< y il(4 + 2(2r? + r7logr7-77log(l-r))-l)) 
^-iW = n ^ + ■ 

Finally consider specifying that ^_i(r)) = fj for f) G [0, In this case we obtain that 
i-iiv) = lf,< Ijfi + [ry > 5I (1 - log 2 - ?7 - log(l - 77)) . 

4 Composite Losses 

General loss functions are often constructed with the aid of a link function. For a 
particular set of prediction values V this is any continuous mapping ip: [0, 1] V. In 
this paper, our focus will be composite losses for binary class probability estimation. 
These are the composition of a CPE loss i: {— l,l}x[0, 1] -^M and the inverse of a link 
function ip, an invertible mapping from the unit interval to some range of values. Unless 
stated otherwise we will assume ip : [0, 1] M. We will denote a composite loss by 

i^{y,v):=i{y,i;-\v)). (19) 



The classical motivation for link functions |35 is that often in estimating rj one uses 
a parametric representation of f): X — >[0,1] which has a natural scale not matching 
[0, 1]. Traditionally one writes ry = tp'^^h) where ip'^ is the "inverse link" (and tp is of 
course the forward link). The function /i: X ^ M is the hypothesis. Often h = ha is 
parametrised linearly in a parameter vector a. In such a situation it is computationally 
convenient if £{r],ip~^{h)) is convex in h (which implies it is convex in cx when ha is 
linear in a). 

Often one will choose the loss first (tailoring its properties by the weighting given ac- 
cording to w{c)), and then choose the link somewhat arbitrarily to map the hypotheses 
appropriately. An interesting alternative perspective arises in the literature on "elic- 
itability". Lambert et al. |28Q provide a general characterisation of proper scoring rules 
(i.e. losses) for general properties of distributions, that is, continuous and locally non- 
constant functions F which assign a real value to each distribution over a finite sample 
space. In the binary case, these properties provide another interpretation of links that is 
complementary to the usual one that treats the inverse link Tp~^ as a way of interpreting 
scores as class probabilities. 

To see this, we first identify distributions over {—1,1} with the probability r] of 
observing 1. In this case properties are continuous, locally non-constant maps F : [0, 1] — > 
M. When a link function ip is continuous it can therefore be interpreted as a property 
since its assumed invertibility implies it is locally non-constant. A property F is said to 
be elicitable whenever there exists a strictly proper loss I for it so that the composite 
loss satisfies for all r) 7^ r/ 

Lr(r?,r)) ■.= Ey^^[f{Y,fi)]>L^{r,,r]). 



See also [15] 
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Theorem 1 of [28^ shows that T is ehcitable if and only if r~^(r) is convex for all 
r E range(r). This immediately gives us a characterisation of "proper" link functions: 
those that are both continuous and have convex level sets in [0, 1] — they are the 
non- decreasing continuous functions. Thus in Lambert's perspective, one chooses a 
"property" first (i.e. the invertible link) and then chooses the proper loss. 



4.1 Proper Composite Losses 



We will call a composite loss £^ (19) a proper composite loss if £ in (19) is a proper 
loss for class probability estimation. As in the case for losses for probability estimation, 
the requirement that a composite loss be proper imposes some constraints on its partial 
losses. Many of the results for proper losses carry over to composite losses with some 
extra factors to account for the link function. 



Theorem 10 Let X = £^ be a composite loss with differentiable and strictly monotone 
link ip and suppose the partial losses A_i(f) and Ai(f) are both differentiable. Then A 
is a proper composite loss if and only if there exists a weight function w : (0, 1) 
such that for all fj G (0, 1) 

-mm _ ^-im)) _ hv) _. ^^.^^ (20) 



1 — f] fj i^'iv) 

where equality is in the Li sense. Furthermore, p{f]) > for all fj G (0, 1). 

Proof This is a direct consequence of Theorem [T] for proper losses for probability esti- 
mation and the chain rule applied to iyifj) = Xy{ip{f])). Since ip is assumed to be strictly 
monotonic we know ip' > and so, since w >0 we have p >0. ■ 

As we shall see, the ratio p{fj) is a key quantity in the analysis of proper composite 
losses. For example. Corollary [2] has natural analogue in terms of p that will be of use 
later. It is obtained by letting fj = 'ip~^{v) and using the chain rule. 

Corollary 11 Suppose is a proper composite loss with conditional risk denoted . 
Then 

§^L^{v,v) = {^p-Hv)-v)p{i^-Hv)). (21) 

Loosely speaking then, p is a "co-ordinate free" weight function for composite losses 
where the link function ijj is interpreted as a mapping from arbitrary v £ V to values 
which can be interpreted as probabilities. 



Another immediate corollary of Theorem 10 shows how properness is characterised 
by a particular relationship between the choice of link function and the choice of partial 
composite losses. 
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Corollary 12 Let X := £^ be a composite loss with differentiable partial losses Ai and 
A_i. Then i"^ is proper if and only if the link ^ satisfies 



(22) 



Proof Substituting fj = i;~^{v) into Q yields -tl'~'^{v)X[{v) = (1 - ij'^{v))X'_iiv) 
and solving this for Tp~^{v) gives the result. ■ 



These results give some insight into the "degrees of freedom" available when speci- 



fying proper composite losses. Theorem 10 shows that the partial losses are completely 



determined once the weight function w and il) (up to an additive constant) is fixed. 



Corollary 12 shows that for a given link tp one can specify one of the partial losses 



but then properness fixes the other partial loss A_,,. Similarly, given an arbitrary choice 



of the partial losses, equation 22 gives the single link which will guarantee the overall 
loss is proper. 



We see then that Corollary 12 provides us with a way of constructing a reference link 
for arbitrary composite losses specified by their partial losses. The reference link can be 
seen to satisfy 

-0(7/) = ai g mm {ri,v) 

for rj G (0, 1) and thus calibrates a given composite loss in the sense of |10] . 

We now briefiy consider an application of the parametrisation of proper losses as a 
weight function and link. In order to implement Stochastic Gradient Descent (SGD) 
algorithms one needs to compute the derivative of the loss with respect to predictions 
t> G M. Letting r/(f ) = 'ip~^{v) be the probability estimates associated with the prediction 



V, we can use (21 ) when rj £ {0, 1} to obtain the update rules for positive and negative 
examples: 

^etiv) = {V{v)-l)p{f^{v)), (23) 
§;;^Uv) = Viv)piviv)). (24) 

Given an arbitrary weight function w (which defines a proper loss via Corollary |2] and 
Theorem |4| and link ip, the above equations show that one could implement SGD directly 
parametrised in terms of p without needing to explicitly compute the partial losses 
themselves. 

Finally, we make a note of an analogue of Corollary [5] for composite losses. It shows 
that the regret for an arbitrary composite loss is related to a Bregman divergence via 
its link. 

Corollary 13 Let be a proper composite loss with invertible link. Then for all r],f] £ 
(0,1), 

^L^{j^,v) = D_l{ii,^-\v)). (25) 
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This corollary generalises the results due to Zhang [50^ and Masnadi-Shirazi and Vas- 
concelos ^32j who considered only margin losses respectively without and with links. 



4.2 Margin Losses 

The margin associated with a real- valued prediction f G M and label y £ { — 1,1} is the 
product z = yv. Any function M ^ can be used as a margin loss by interpreting 
4>{yv) as the penalty for predicting v for an instance with label y. Margin losses are 
inherently symmetric since yv = {—y){—v) and so the penalty (p{yv) given for predicting 
V when the label is y is necessarily the same as the penalty for predicting —v when the 
label is —y. Margin losses have attracted a lot of attention [5] because of their central 
role in Support Vector Machines ITT\ . In this section we explore the relationship between 
these margin losses and the more general class of composite losses and, in particular, 
symmetric composite losses. 

Recall that a general composite loss is of the form i^{y,v) = i{y,i]j~^{v)) for a loss 
y X [0, 1] — > [0, cxd) and an invertible link ip: M — > [0, 1]. We would like to understand 
when margin losses can be understood as losses suitable for probability estimation tasks. 
As discussed above, proper losses are a natural class of losses over [0, 1] for probability 
estimation so a natural question in this vein is the following: given a margin loss (p can 
we choose a link tj) so that there exists a proper loss (. such that (t>{yv) = £^{y,v)7 In 
this case the proper loss will be i{y,fi) = (j){yip{fi)). 

The following corollary of Theorem [TO] gives necessary and sufficient conditions on 
the choice of link tp to guarantee when a margin loss (j) can be expressed as a proper 
composite loss. 

Corollary 14 Suppose (j): ^ is a differentiahle margin loss. Then, (t){yv) can he 
expressed as a proper composite loss i'^{y, v) if and only if the link ip satisfies 

Proof Margin losses, by definition, have partial losses \y{v) = (p^yv) which means 



X'i{v) = (j)'{v) and X'_i{v) = —(f)'{—v). Substituting these into (22) gives the result 



This result provides a way of interpreting predictions v as probabilities t) = iIj~^{v) 
in a consistent manner, for a problem defined by a margin loss. Conversely, it also 
guarantees that using any other link to interpret predictions as probabilities will be 
inconsistent]^ Another immediate implication is that for a margin loss to be considered 
a proper loss its link function must be symmetric in the sense that 

^-\-y) = = 1 = 1 _ 

^ ^ ^ (P'{v) + <P'{-v) <P'{-v) + 4>'{v) ^ ^ 



* Strictly speaking, if the margin loss has "flat spots" — i.e., where <;/)'(«) =0 — then the choice of 
link may not be unique. 
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and so, by letting v = il^ifj), we have -(/'(l — '>)) = —i^iv) ^^'^ thus ^'(5) = 0- 



Corollary 14 can also be seen as a simplified and generalised version of the argument 



by Masnadi-Shirazi and Vasconcelos [32] that a concave minimal conditional risk function 
and a symmetric link completely determines a margin los^ 

We now consider a couple of specific margin losses and show how they can be associ- 
ated with a proper loss through the choice of link given in Corollary [T4j The exponential 
loss 4>{v) = gives rise to a proper loss i{y,fi) = (p{yip{fl)) via the link 



1 + e 



-2v 



which has non-zero denominator. In this case V'CO) = 2 ^^^^ logistic 

link. Now consider the family of margin losses parametrised by q G (0, 00) 



4>a{v) 



log(exp(l — v)a) + 1) 



a 



This family of differentiable convex losses approximates the hinge loss as a — > and 
was studied in the multiclass case by Zhang et al. jiSli. Since these are all differentiable 
functions with (j)L(v) 



i(l-u) 

iCi-i-j+l 



, Corollary 



14 



and a little algebra gives 



ij-\v) 



1 + 



g2a _^ ^a{l-v) 
g2a _j_ go(l+D) 



Examining this family of inverse links as a ^ gives some insight into why the hinge loss 
is a surrogate for classification but not probability estimation. When a ~ an estimate 
f) = 'ip~^{v) ~ ^ for all but very large G M. That is, in the limit all probability 
estimates sit infinitesimally to the right or left of i depending on the sign of v. 



5 Classification Calibration and Proper Losses 

The notion of properness of a loss designed for class probability estimation is a natural 
one. If one is only interested in classification (rather than estimating probabilities) a 
weaker condition suffices. In this section we will relate the weaker condition to proper- 
ness. 

5.1 Classification Calibration for CPE Losses 

We begin by giving a definition of classification calibration for CPE losses (i.e., over the 
unit interval [0, 1]) and relate it to composite losses via a link. 

" Shen |44l Section 4.4] seems to have been the first to view margin losses from this more general 
perspective. 
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Definition 15 We say a CPE loss i is classification calibrated at c G (0, 1) and write 
I is CCc if the associated conditional risk L satisfies 



Vr?/c, L(r?)< inf L{rj,f,). (27) 

rj: (17— c)(r?— c)<0 

The expression constraining the infimum ensures that fj is on the opposite side of c to 
r], or fj = c. 

The condition CCi is equivalent to what is called "classification calibrated" by 

2 

Bartlett et al. [7] and "Fisher consistent for classification problems" by Lin [30] although 
their definitions were only for margin losses. 

One might suspect that there is a connection between classification calibrated at c 
and standard Fisher consistency for class probability estimation losses. The following 
theorem, which captures the intuition behind the "probing" reduction [29], characterises 
the situation. 

Theorem 16 A CPE loss i is CCc for all c G (0, 1) if and only if I is strictly proper. 
Proof L is CCc for all c G (0, 1) is equivalent to 



VcG (0,1), V?7/c 
^ V?7 G (0,1), Vc / r? 
^ V??G(0,1), 



L(r/) < inf^j>c L{r], fj), 7] < c 
Uji) < mifi<c L{r], fj), rj > c 

Vc > rj, L{ri) < ini f,>c L{rj,fj) 
Vc < T], L{ri) < inf^<c L{rj, fj) 

L{rj) < inf^>c>r, L{rj, fj) 
L{rj) < inf^<c<r, L{rj, fj) 
^ V?? G (0,1), L(r/) < inf L(r?, ??) 

{V>V) or (r)<»7) 

44> V?7 G (0, 1), L{rj) < inf L{rj, fj) 
which means L is strictly proper. ■ 

The following theorem is a generalisation of the characterisation of CCi for margin 

2 

losses via (p'{0) due to Bartlett et al. [7j. 

Theorem 17 Suppose i is a loss and suppose that i'l and exist everywhere. Then 
for any c G (0, 1) i is CCc if o-nd only if 

/_i(c) > and ^^(c) < and c/i(c) + (1 - c)i'_^{c) = 0. (28) 



Proof Since H.'-^ and are assumed to exist everywhere 

d 

— L(?7, fj) = i]i[{fj) + (1 - rj)C^{fj) 
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exists for all ?). L is CCc is equivalent to 

d 



dfj 



L{r],fi) 



r)=c 



>0, 
<0, 



rf < c <t) 
fj < c < r] 



4^ 



Vr/ < c, rii[{c) + (1 - r/)f_i(c) > 
Vr/ > c, r?£;(c) + (1 - r/)f_i(c) < 

<(c) + (l-cK„i(c) = 
and f„i(c) > and l[{c) < 0, 



where we have used the fact that (29) with t] 
implies £'_^{c) > and i[{c) < 0. 



(29) 
(30) 

and T] = 1 respectively substituted 



If £ is proper, then by evaluating ([t]) at r/ = and r/ = 1 we obtain £'i{fi) = —w{fi){l — 
fj) and I'^iifi) = w{fj)f]. Thus (30) implies —w[c){l — c) < and w{c)c > which holds 



if and only if w{c) ^ 0. We have thus shown the following corollary. 

Corollary 18 If i is proper with weight w, then for any c £ (0, 1), 

w{c) ^0 ^ £ is CCc. 

The simple form of the weight function for the cost-sensitive misclassification loss Icq 
{w{c) = 5{c — Co)) gives the following corollary (confer Bartlett et al. [7]): 

Corollary 19 ^-^ CCc if o-nd only if cq = c. 
5.2 Calibration for Composite Losses 

The translation of the above results to general proper composite losses with invertible 



differentiable link is straight forward. Condition (27) becomes 

Vr?/c, L^'(^)< , , inf^^ ^ {r,,^-\v)). 

v. {ip-^{v)-c){Ti-c)<0 



Theorem 16 then immediately gives: 



Corollary 20 A composite loss = i{-,tp ^(•)) with invertible and differentiable 

link tp is CCc for all c G (0, 1) if and only if the associated proper loss i is strictly proper. 



Theorem 17 immediately gives: 



Corollary 21 Suppose £^ is as in Corollary 20 and that the partial losses £i and l-i of 



the associated proper loss I are differentiable. Then for any c G (0, 1), £^ is CCc */ 



only if (28) holds. 



It can be shown that in the special case of margin losses Ls^ which satisfy the conditions 



of Corollary [14] such that they are proper composite losses, Corollary 21 leads to the 
condition (/''(O) < which is the same as obtained by Bartlett et al. [3- 
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6 Convexity of Composite Losses 



We have seen that composite losses are defined by the proper loss I and the link ijj. We 



have further seen from (14) that it is natural to parametrise composite losses in terms 
of w and V'; and combine them as p. One may wish to choose a weight function w 
and determine which links ip lead to a convex loss; or choose a link %[) and determine 
which weight functions w (and hence proper losses) lead to a convex composite loss. The 
main result of this section is Theorem [29] answers these questions by characterising the 
convexity of composite losses in terms of {w,ip') or p. 

We first establish some convexity results for losses and their conditional and full 
risks. 

Lemma 22 Let £ : y x V — > [0, oo) denote an arbitrary loss. Then the following are 
equivalent: 

1. V ^ £{y, v) is convex for all y £ {—1, 1}, 

2. V ^ L{7], v) is convex for all rj e [0, 1], 

3. V ^ ]L(f , S) := 1^ Yli{x y)es ^(y^ v{x)) is convex for all finite 5" C X x y . 

Proof [T]=^[2| By definition, L{r],v) = (1 — ri)£{—l,v) + ri£{l,v) which is just a convex 
combination of convex functions and hence convex. 

2 =^[T} Choose rj = and ry = 1 in the definition of L. 

1 [sj For a fixed (x,y), the function v i-^ £{y,v{x)) is convex since £ is convex. 
Thus, L is convex as it is a non-negative weighted sum of convex functions. 

[3] =^ [l| The convexity of L holds for every S so for each y G {—1,1} choose S = 
{{x,y)} for some x. In each case v 1— > L,{v,S) = £{y,v{x)) is convex as required. ■ 

The following theorem generalises the corollary on page 12 of Buja et al. [9j to arbitrary 
composite losses with invertible links. It has less practical value than the previous 
lemma since, in general, sums of quasi-convex functions are not necessarily quasi-convex 
(a function / is quasi-convex if the set {x: f{x) > a} is convex for all a G M). Thus, 
assuming properness of the loss £ does not guarantee its empirical risk L(-, S*) will not 
have local minima. 

Theorem 23 If£'''{y,v) = £{y,Tp-^{v)) is a composite loss where £ is proper and ip is 
invertible and differentiable then L^{r],v) is quasi-convex in v for all rj G [0, 1]. 



Proof Since £ is proper we know by Corollary 11 that the conditional Bayes risk satisfies 

Since ^ is invertible and p > we see that ^L'^{rj,v) only changes sign at r] = 'ip~^{v) 
and so is quasi-convex as required. ■ 

The following theorem characterises convexity of composite losses with invertible 
links. 
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Theorem 24 Let £^{y,v) be a composite loss comprising an invertible link ip with in- 
verse q := ■0""^ and strictly proper loss with weight function w. Assume q'{-) > 0. Then 
V ^ P^{y, v) is convex for y € { — 1,1} if and only if 



1 ^ w'{x) iIj"[x) 
X ~ w{x) ip'{x) 



< 



1 — x 



yx G (0,1) 



(31) 



This theorem suggests a very natural parametrisation of composite losses is via {w,^'). 
Observe that Wjip' : [0, 1] — > M^. (But also see the comment following Theorem 29 ) 
Proof We can write the conditional composite loss as 

L'f'{rj,v) = vh{q{v)) + (1 - v)^-i{q{v)) 



and by substituting q = ip ^ into (21 ) we have 

d 



dv 



w{q{v))q{v)[q{v) - r/]. 



(32) 



A necessary and sufficient condition for v £^{y,v) = L^{y,v) to be convex for y G 
{-1, 1} is that 

^\L^{y,v)>0, Vt;GM, VyG{-l,l}. 



(33) 



Using ( 32 1 the above condition is equivalent to 

[wiq{v))q'{v)]'{q{v)-ly = lj)+wiqiv))q'{v)q\v) > 0, £ R, 

where 

lid I 

[w{q{v))q {v)]' := -7^w{q{v))q' (v). 

Inequality (331) is equivalent to |E1 equation 39]. By further manipulations, we can 
simplify ([33) considerably. 

Since ly = Ij is either or 1 we equivalently have the two inequalities 



[w{q{v))q'{v)]'q{v) + w{q{v)){q'{v))'^ > 0, G 
[w{q{v))q'{v)]'{q{v)-l)+w{q{v)){q'{v)f > 0, G 



{y 



-I) 



which we shall rewrite as the pair of inequalities 



w{q{v)){q'{v)f 



> 



-q{v)[w{q{v))q'{v)]', 



yv G 



w{q{v)){q'{v)f > {l-q{v))[w{q{v))q'{v)]', V^; G 



(34) 
(35) 



(resp. 1 — q{-) = 0) then (34) (resp. (35)) is satisfied anyway 



Observe that if q{- 

because of the assumption on q' and the fact that w is non-negative. It is thus equivalent 
to restrict consideration to v in the set 



{x:g(x)/0 and (1 - g(x)) / 0} = g-^((0, 1)) = V((0, 1)). 
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Combining (34) and (|35j) we obtain the equivalent condition 



-{q'iv)y 



(36) 



1 - q{v) w{q{v)) q{v) 

where we have used the fact that g : M ^ [0, 1] and is thus sign-definite and consequently 
—q{-) is always negative and division by q{v) and 1 — q{v) is permissible since as argued 
we can neglect the cases when these take on the value zero, and division by w{q{v)) is 
permissible by the assumption of strict properness since that implies w{-) > 0. Now 

[Hqi-))q{-)]' = w'{q{-))q'{-)q'{-) + w{q{-))q"{-) 
and thus (|36]) is equivalent to 

{q'iv)f ^ w'{q{v)){q'{v)f+wiqiv))q"iv) ^ -{q'iv)f 



1 - qiv) 



-, \/ve^{{0,l)) (37) 



w{q{v)) 

Now divide all sides of (37) by ((?'(-))^ (which is permissible by assumption). This gives 
the equivalent condition 

1 



1 



S > ^ V„€«(0,1)). 

w[q[v)) [q'{v)y q[v) 



Let X = q{v) and so v = q ^(x) = il}[x). Then (38) is equivalent to 



1— X w{x) ~'~ {q'{i^{x))y X ' ( ' ) 



Now 



1 



q'{^{x)) q'{q-Hx)) 
1 



1 — X 



> 



'^)'(x) = tp'{x). Thus (39) is equivalent to 
w'(x) _ , ^ . —1 



W[Xj X 



Vx G (0,1), 



where 



$^(x) ■.= q"mx)){i;'{x)y. 
All of the above steps are equivalences. We have thus shown that 



(38) 
(39) 

(40) 
(41) 



(40) is true 4^ u i— > L^[y,v) is convex for y G { — 1, 1} 



where the right hand side is equivalent to the assertion in the theorem by Lemma 22 
Finally we simplify We first compute q" in terms of ^ = q~^. Observe that 



Q"i-) 



1 



#(v^-i(.)) 
-1 

-1 
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Thus by substitution 



(42) 



Substituting the simpler expression ( 42 ) for <I>^ into ( 40 ) completes the proof. 



Lemma 25 If q is affine then = 0. 



Proof Using (42), this is immediate since in this case tp"{-) = 0. 



Corollary 26 Composite losses with a linear link ( including as a special case the identity 
link) are convex if and only if 

1 w'ix) 1 . . 

— < — < , VxG 0,1. 

x w[x) 1 — X 

6.1 Canonical Links 

Buja et al. p! introduced the notion of a canonical link defined by ip'{v) = w{v). The 
canonical link corresponds to the notion of "matching loss" as developed by Helmbold 
et al. [20] and Kivinen and Warmuth [26]. Note that choice of canonical link implies 
p{c) = w{c)/i^'{c) = 1. 

Lemma 27 Suppose i is a proper loss with weight function w and tp is the corresponding 
canonical link, then 

M^) = (43) 



Proof Substitute ip' = w into (42). 



This lemma gives an immediate proof of the following result due to Buja et al. [9|. 

Theorem 28 A composite loss comprising a proper loss with weight function w com- 
bined with its canonical link is always convex. 



Proof Substitute (43) into (31) to obtain 



-- < < Vx G (0,1) 

X 1 — X 

which holds for any w. 

An alternative view of canonical links is given in Appendix iBj 
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6.2 A Simpler Characterisation of Convex Composite Losses 

The following theorem prrovides a simpler characterisation of the convexity of composite 
losses. Noting that loss functions can be multiplied by a scalar without affecting what 
a learning algorithm will do, it is convenient to normalise them. If w satisfies (31) then 
so does aw for all a £ (0, oo). Thus without loss of generality we will normalise w such 
that w{^) = 1. We chose to normalise about ^ foi' two reasons: symmetry and the fact 
that w can have non-integrable singularities at and 1; see e.g. [9]. 

Theorem 29 Consider a proper composite loss & with invertible link and (strictly 
proper) weight w normalised such that w{\) = 1. Then I is convex if and only if 

^ I 2^'(i)^x) I V:rG(0,l), (44) 

where ^ denotes < for x > ^ and denotes > for x < ^. 



Observe that the condition (44) is equivalent to 



I P(^) I 7^77771^^ IV' VxG(0,l), (45) 



2ij'{^)x > ' ^ ' > 2i;'{^){l-x) 
which suggests the importance of the function p(-). 

Proof Observing that ^^^^-j = (logw)' (x) we let g(x) := logw(x). Observe that g{v) 
Jl g'{x)dx + g{^) and g{^) = logw{^) = 0. Thus from (31) we obtain 

^ ^^(x) < g'{x) < ^^{x). 

1 — X 



X 



For u > 2 we thus have 



1 r 1 

- - ^^{x)dx < g{v) < / <!>^{x)dx. 

IX 1 1 — X 



2 



Conversely, for v < h we have 



2 

1 r 1 

<!>^{x)dx > g(v) > / ^^{x)dx, 

IX n 1 — X 

2 "^2 



and thus 



/•V rv 

-lnv-ln2- / (^^{x)dx ^ g{v) ^ - ln2 - ln(l - w) - / (^^[x)dx. 
Since exp(-) is monotone increasing we can apply it to all terms and obtain 

^^{x)dx\ I w{v) ^ 2(1^- v) "^""P (~ / ^^(^)^^)- (46) 
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Figure 2: Allowable normalised weight functions to ensure convexity of composite loss 
functions with identity link (left) and logistic link (right). 



Now 

/•V 



" -^rMd^ = - rM')'{x)dx = -\og^'{v)+\og^'{\) 



and so 



exp i~J^ ^tp{x)dx 



Substituting into (|46j) completes the proof. ■ 
If ip is the identity {i.e. if £^ is itself proper) we get the simpler constraints 

;^ I Hx) I ^^7T^, VxG(0,l), (47) 
2x ^ ^ 2(1 — x) 

which are illustrated as the shaded region in Figure |2] Observe that the (normalised) 
weight function for squared loss is w{c) = 1 which is indeed within the shaded region as 
one would expect. 

Consider the link ■i/;'°S'*(c) := log {^^t^ with corresponding inverse link g(c) = jq:^^- 

One can check that V''(c) = c{\-c) • Thus the constraints on the weight function w to 
ensure convexity of the composite loss are 

1 



w{x) I — -, V2;G(0,1) 



8x2(1 -x) > ^ ^ > 8x(l -x)2' 

This is shown graphically in Figure |2j One can compute similar regions for any link. 
Two other examples are the Complementary Log-Log link i\P^^{x) = log(— log(l — x)) 
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Figure 3: Allowable normalised weight functions to ensure convexity of loss functions 
with complementary log-log, square and cosine links. 



(confer McCullagh and Nelder [33]), the "square link" ip^^{x) = and the "cosine link" 
tlj^°^[x) = 1 — cos(7rx). All of these are illustrated in Figure|3j The reason for considering 
these last two rather unusual links is to illustrate the following fact. Observing that 
the allowable region in Figure |2] precludes weight functions that approach zero at the 
endpoints of the interval, and noting that in order to well approximate the behaviour 
of 0-1 loss (with its weight function being wo-i{c) = S{c — |)) one would like a weight 
function that does indeed approach zero at the end points, it is natural to ask what 
constraints are imposed upon a link ip such that a composite loss with that link and a 
weight function 'w{c) such that 

lim w(c) = lim w(c) = (48) 

c\0 c/l 



is convex. Inspection of (44) reveals it is necessary that ip'{x) ^ as a; ^ and x ^ 1. 
Such ip necessarily have bounded range and thus the inverse link ip~^ is only defined on a 
finite interval and furthermore the gradient of tp~^ will be arbitrarily large. If one wants 
inverse links defined on the whole real line (such as the logistic link) then one can not 
obtain a convex composite link with the associated proper loss having a weight function 



satisfying (48). Thus one can not choose an effectively usable link to ensure convexity 
of a proper loss that is arbitrarily "close to" 0-1 loss in the sense of the corresponding 
weight functions. 

Corollary 30 If a loss is proper and convex, then it is strictly proper. 

The proof of Corollary [30] makes use of the following special case of the Gronwall style 
Lemma 1.1.1 of Bainov and Simeonov [3]. 

Lemma 31 Let 6: M — > M 6e continuous for t > a. Let v{t) be differentiahle for t > a 
and suppose v'{t) < b{t)v{t), for t > a and v{a) < vq. Then for t > a, 



v{t) < foexp 



b{s)ds 
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Proof (Corollary 



30) Observe that the RHS of (31 ) impHes 



,, , w(v) 
w'(v) < v>0. 

1 — V 

Suppose w{0) = 0. Then vq = and the setting a = the lemma implies 

w{t) < vo exp (^^ Y^/"") = =0' * ^ (0' 1] ■ 

Thus if w(0) = then w{t) = for all t £ (0, 1). Choosing any other a £ (0, 1) leads to a 
similar conclusion. Thus w(t) = for some t £ [0,1), w{s) = for all s £ [t,l]. Hence 
w{t) > for all t £ [0, 1] and hence by the remark immediately following Theorem [6]£ is 
strictly proper. ■ 



7 Choosing a Surrogate Loss 

A surrogate loss function is a loss function which is not exactly what one wishes to 
minimise but is easier to work with algorithmically. Convex surrogate losses are often 
used in place of the 0-1 loss which is not convex. 

Surrogate losses have garnered increasing interest in the machine learning commu- 
nity [501 13 113 EZ] • Some of the questions considered to date are bounding the regret 
of a desired loss in terms of a surrogate ("surrogate regret bounds" — see |39j and 
references therein), the relationship between the decision theoretic perspective and the 
elicitability perspective [32], and efficient algorithms for minimising convex surrogate 
margin losses |35l l3^ . 

Typically convex surrogates are used because they lead to convex, and thus tractable, 
optimisation problems. To date, work on surrogate losses has focussed on margin losses 
which necessarily are symmetric with respect to false positives and false negatives [9]. 
In line with the rest of this paper, our treatment will not be so restricted. 



7.1 The "Best" Surrogate Loss 

There are many choices of surrogate loss one can choose. A natural question is thus 
"which is best?". In order to do this we need to first define how we are evaluating losses 
as surrogates. To do this we require notation to describe the set of minimisers of the 
conditional and full risk associated with a loss. Given a loss i: { — 1,1} x V M its 
conditional minimisers at ij £ [0, 1] is the set 

H{£,ri) := {v £V: L{r],v) = L{V)}- (49) 
Given a set of hypotheses "K C V'^, the (constrained) Bayes optimal risk is 

L:„:= inf L(/i,P). 

AiG Ji, 
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The (full) minimisers over "K for P is the set 

J{(£,P) := {h £ : L(/i) = L^}, 

where 5f C V-^ is some restricted set of functions and L(/i) := E(x,y)~p[^(Y, /i(X))] 
and the expectation is with respect to P. Given a reference loss i^ef, we wih say the 
ij-ei- surrogate penalty of a loss i over the function class !K on a problem (rj, M) (or 
equivalently P) is 

5,^^,(Ar?,M) = S,^^^{1,¥) := ^^inf ^(M, 

where it is important to remember that L is with respect to P. That is, S'£^^j(£, P) is the 
minimum i^^i risk obtainable by a function in "K that minimises the I risk. 

Given a fixed experiment P, if £ is a class of losses then the best surrogate losses 
in L for the reference loss ^i-ef ^-re those that minimise the ^i-ef-surrogate penalty. This 
definition is motivated by the manner in which surrogate losses are used — one minimizes 
L(/i) over h to obtain the minimiser h* and one hopes that Li.ef(/i*) is small. Clearly, 
if the class of losses contains the reference loss {i.e., i^ei £ ^) then ^ref will be a best 
surrogate loss. Therefore, the question of best surrogate loss is only interesting when 
^ref ^ ^- One particular case we will consider is when the reference loss is the 0-1 loss 
and the class of surrogates H is the set of convex proper losses. Since 0-1 loss is not 
convex the question of which surrogate is best is non-trivial. 

It would be nice if one could reason about the "best" surrogate loss using the con- 
ditional perspective (that is working with L instead of L) and in a manner independent 
of "K. It is simple to see why this can not be done. Since all the losses we consider are 
proper, the minimiser over i) of L{rj, fj) is rj. Thus any proper loss would lead to the 
same 17 G [0, 1]. It is only the introduction of the restricted class of hypotheses !K that 
prevents this reasoning being applied for L: restrictions on h £ 'K prevent h(x) = r]{x) 
for all X S X. We conclude that the problem of best surrogate loss only makes sense 
when one both takes expectations over X and restricts the class of hypotheses h to be 
drawn from some set !K C [0, l]'^. 

This reasoning accords with that of Nock and Nielsen |351 |35j who examined which 
surrogate to use and proposed a data-dependent scheme that tunes surrogates for a 
problem. They explicitly considered proper losses and said that "minimizing any [lower- 
bounded, symmetric proper] loss amounts to the same ultimate goal" and concluded 
that "the crux of the choice of the [loss] relies on data-dependent considerations" . 

We demonstrate the difficulty of finding a universal best surrogate loss in by con- 
structing a simple example. One can construct experiments {rji^M) and (r/2,M) and 
proper losses £1 and I2 such that 

S,,_,{h,{iii,M)) > Se,_,{i2,{m,M)) but Se,_,{ii,{v2,M)) < S,,,,{i2,{m,M)). 

(The examples we construct have weight functions that "cross-over" each other; the 
details are in Appendix|A|) However, this does not imply there can not exist a particular 
convex C that minorises all proper losses in this sense. Indeed, we conjecture that, in 
the sense described above, there is no best proper, convex surrogate loss. 
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Conjecture 32 Given a proper, convex loss I there exists a second proper, convex loss 
£* ^ £, a hypothesis class and an experiment P such that Sii^_-^{i* ,¥) < SiQ_-^{i,F) 
for the class 

To prove the above conjecture it would suffice to sliow tliat for a fixed liypotliesis class 
and any pair of losses one can construct two experiments such that one loss minorises 
the other loss on one experiment and vice versa on the other experiment. 

Supposing the above conjecture is true, one might then ask for a best surrogate loss 
for some reference loss i^ef in a minimax sense. Formally, we would like the loss G XL 
such that the worst-case penalty for using i*, 

:= sup \SU^*,F) - inf Se,J£,F)] 

is minimised. That is, T^(r) < Tc{e) for all £ e L. 



7.2 The "Minimal" Symmetric Convex Proper Loss 



Theorem 29 suggests an answer to the question "What is the proper convex loss closest 
to the 0-1 loss?" A way of making this question precise follows. Since i is presumed 
proper, it has a weight function w. Suppose w.l.o.g. that w{^) = 1. Suppose the link is 



the identity. The constraints in (31 ) imply that the weight function that is most similar 



to that for 0-1 loss meets the constraints. Thus from (47) 



^minimal = ^ f" A ) (50) 



2 \C 1 

is the weight for the convex proper loss closest to 0-1 loss in this sense. It is the 
weight function that forms the lower envelope of the shaded region in the left diagram of 
Figure [2] Using ( |14[ ) one can readily compute the corresponding partial losses explicitly 

^minimal(^) = ^ ([0 < ^K"^ " ln(l " 0)) + [r) > " 1 " ln(^))) (51) 

and 

^minimal(^) = ^ (J^ < _ log(l)) + > _ 1 _ Inf,)) . (52) 

Observe that the partial losses are (in part) linear, which is unsurprising as linear func- 
tions are on the boundary of the set convex functions. This loss is also best in another 
more precise (but ultimately unsatisfactory) sense, as we shall now show. 

Surrogate regret bounds are theoretical bounds on the regret of a desired loss (say 
0-1 loss) in terms of the regret with respect to a surrogate. Reid and Williamson [39J 
have shown the following (we only quote the simpler symmetric case here): 

Theorem 33 Suppose i is a proper loss with corresponding conditional Bayes risk L 
which is symmetric about — c) = L(^ -|- c) for c S [0, // the regret for the li 

loss ALi(ri,fj) = a, then the regret AL with respect to i satisfies 

2 

AL{r,,fi)>La)-LQ + c^)- (53) 
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Figure 4: Upper bound on the 0-1 regret in terms of AL™"™'^^ as given by (54) 



The bound in the theorem can be inverted to upper bound ALi given an upper bound 

2 

on AL(r/,r/). Considering all symmetric proper losses normalised such that w{h) = 1, 



the right side of (53) is maximised and thus the bound on ALi in terms of AL is 

11 2 

minimised when + a) is maximised (over all losses normalised as mentioned). But 
-L", that occurs for the pointwise minimiser of w (subject to w{^] 



smce w 



!)• 

Since we are interested in convex losses, the minimising w is given by (|50l). In this case 



the right hand side of (53) can be explicitly determined to be (^ + log(2a + 1) 
and the bound can be inverted to obtain the result that if AL™™™^^'' 



ALi{r],fi) < - exp ( LambertW 

2 2 



(4x - 1) 



+ 1 



{rj, fj) = X then 

1 
2 



2 ' 



(54) 



which is plotted in Figure |4| 

The above argument does not show that the loss given by ( |5Tp2| ) is the best surrogate 



loss. Nevertheless it does suggest it is at least worth considering using 
proper surrogate binary loss. 



i«minimal 



as a convex 



8 Conclusions 



Composite losses are widely used. In this paper we have characterised a number of 
aspects of them: their relationship to margin losses, the connection between properness 
and classification calibration, the constraints symmetry imposes, when composite losses 
are convex, and natural ways to parametrise them. We have also considered the question 
of the "best" surrogate loss. 
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The parametrisation of a composite loss in terms {w,ip') (or p) has advantages over 
using {(p,ip) or {L,ip). As explained by Masnadi-Shirazi and Vasconcelos [32j, the rep- 
resentation in terms of (0, v^) is in general not unique. The representation in terms of L 
is harder to intuit: whilst indeed the Bayes risk for squared loss and 0-1 loss are "close" 
(compare the graph of c i-^ c(l — c) with that of c i-^ c A (1 — c)), by examining their 
weight functions they are seen to be very different {w{c) = 1 versus w{c) = 26{c — ^)). 



We have also seen that on the basis of Theorem 24 the parametrisation (w, ip') is per- 
haps the most natural — there is a pleasing symmetry between the loss and the link as 
they are in this form both parametrised in terms of non-negative weight functions on 
[0, 1]. Recall too that the canonical link sets ip' equal to w. 

The observation suggests an alternate inductive principle known as surrogate tuning, 
which seems to have been first suggested by Nock and Nielsen [35] • The idea of surrogate 
tuning is simple: noting that the best surrogate depends on the problem, adapt the 
surrogate you are using to the problem. In order to do so it is important to have 
a good parametrisation of the loss. The weight function perspective does just that, 



especially given Theorem 29 It would be straight forward to develop low dimensional 
parametrisations of w that satisfy the conditions of this theorem which would thus allow 
a learning algorithm to explore the space of convex losses. One could (taking due care 
with the subsequent multiple hypothesis testing problem) regularly evaluate the 0-1 loss 
of the hypotheses so obtained. The observations made in section |4] regarding stochastic 
gradient descent algorithms may be of help in this regard. 

Surrogate tuning differs from loss tailoring |18| [T9l ^ which involves adapting the loss 
to what you really think is important. In the surrogate tuning setting, we have fixed on 
0-1 loss as what we really want to minimise and use a surrogate solely for computational 
reasons. 



Finally, we conjecture that (equations 51 and 52) is somehow special in the 

class of proper convex losses in some way other than being the pointwise minimiser of 
weights (and the normalised loss with smallest regret bound with respect to but 
the exact nature of the specialness still eludes us. Perhaps it is optimal in some weaker 
(minimax) sense. The reason for this suggestion is that it is not hard to show that for 
reasonable P there exists "K such that c i-^ Lc(/i, P) takes on all possible values within 
the constraints 

< Lc(/i, P) < max(c, 1 - c) 

which follows immediately from the definition of cost- sensitive misclassification loss. 
Furthermore the example in the appendix below seems to require loss functions whose 
corresponding weight functions cross over each other and there is no weight function 
corresponding to a convex proper loss that crosses over y;minimal_ 
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A Example Showing Incommensurability of Two Proper 
Surrogate Losses 

We consider X = [0, 1] with M being uniform on X, and consider the two problems that 
are induced by 

rii{x) = and 7/2(2;) = ^ + |. 

We use a simple linear hypothesis class 

% := {ha[x) := ax : a G [0, 1]}, 

with identity link function and consider the two surrogate proper losses £1 and I2 with 
weight functions 

wi{c) = ^, W2{c) = YZTc 

These weight functions correspond to the two curves that construct the left diagram in 
Figure [2j The corresponding conditional losses can be readily calculated to be 

Li(r/,/i) := r/(/i-l-log(/i)) + (l-7?)/i 
L2(r?,/i) := r/(l-/i) + (l-7?)(-/i-log(l-/i)). 

One can numerically compute the parameters for the constrained Bayes optimal for each 
problem and for each surrogate loss: 



* 

"1,1 


= argminLi(r/i, 

ae[0,l] 


h 


M) 


= 0.66666667 


* 

«2,1 


= arg min L2 (ryi , 

ae[o,i] 


/Iq,, 


M) 


= 0.81779259 


* 

«1,2 


= argminLi(r/2, 

ae[0,l] 


h 


M) 


= 1.00000000 


* 

"2,1 


= argminL2(r?2, 

a6[0,l] 


/Iq,, 


M) 


= 0.77763472, 



Furthermore 

ILo-i(??i,/ia-,,M) = 0.3580272, Lo-i(r/i, /i^,. ^, M) = 0.3033476, 
ILo-i(??2,/i<,,M) = 0.4166666, Lo-i(r/2, /i^^ ,, M) = 0.4207872. 

Thus for problem r/i the surrogate loss L2 has a constrained Bayes optimal hypothesis 
ha* which has a lower 0-1 risk than the constrained Bayes optimal hypothesis ha* for 
the surrogate loss Li. Thus for problem r/i surrogate L2 is better than surrogate Li. 
However for problem r]2 the situation is reversed: surrogate L2 is worse than surrogate 
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B An Alternate View of Canonical Links 



This appendix contains an alternate approach to understanding canonical links using 
convex duality. In doing so we present an improved formulation of a result on the 
duality of Bregman divergences that may be of independent interest. 

The Legendre-Fenchel (LF) dual (/>* of a function (/>: M — > M is a function defined by 

</.^(s*) :=sup{(s,s*)-</>(s)}. (55) 

The LF dual of any function is convex. 

When (f){s) is a function of a real argument s and the derivative (p'{s) exists, the 
Legendre-Fenchel conjugate (j)* is given by the Legendre transform |40| [2T] 

<P*{s) = s-i<pr\s)-cl>m'\s)). (56) 

Thus (writing df := f) f = {df*)^^. Thus with w, W, and W defined as above, 

W = {d(W*))-\ W-^=d{W*), W'' = jw-\ (57) 

Let w, W, be as in Theorem [Tj Denote by Lw the w-weighted conditional loss 
parametrised hy W = f w and let AL\y be the corresponding regret (we can interchange 
AL and D here by (25) since ipL = id. 

D^(r/, fi) = W{ri) - W{ri) - {rj - ^)W{^). (58) 



We now further consider Dw as given by (58). It will be convenient to parametrise D 
by W instead of w. Note that the standard parametrisation for a Bregman divergence 
is in terms of the convex function W. Thus will write D\y, Dw D^i to all represent 
(58). The following theorem is known (e.g [l9]) but as will be seen, stating it in terms 
of Dy/ provides some advantages. 

Theorem 34 Let w, W , W and Dw be as above. Then for all x,y €z [0, 1], 

Dw{x,y) = Dw-i{W{y),W{x)). (59) 



Proof Using (56) we have 

W*{u) = u ■ W-^{u) - W{W~^{u)) 

W{W~^{u)) = u-W-^{u) -W*{u). (60) 



Equivalently (using (57)) 

W\W{u)) =u-W{u) -W{u). (61) 



Thus substituting and then using ( 60 ) we have 

Dw{x,W'^{v)) = W{x)-W{W-'^{v))-{x-W-^{v))-W{W-^{v)) 
= W{x) + W\v) - vW'^{v) - (x - W-^{v)) • V 
= W{x)+W*{v)-x-v. (62) 



30 



Similarly (this time using (61) we have 

Dw-i{v,W{x)) = Vi^iv) -W*{W{x)) - {v -W{x)) -W-^iWix)) 
= W*{v) — xW{x) + W{x) — V ■ X + xW{x) 
= W'iv) + W{x) -V X 



(63) 



Comparing ( 62 ) and ( 63 ) we see that 

Dw{x,W-Hv)) = Dw-i{v,W{x)) 



Let y = W ^{v). Thus subsitituting v = W{y) leads to (59). 

The weight function corresponding to D^^-i is ^W~^{x) = • 

Theorem 35 // the inverse link ip~^ = W^'^ (and thus f) = W~^{h) ) then 
Dw{v,v) = Dw{v,W'Hh))=W{r])+W\h)-r]-h 



d_ 
dh 



Lw{v,W-Hh)) 



Lwiv, W~\h)) = W^ih) -r]-h + v{W{l) + W{0)) - W{0) 
fj — rj 



and furthermore D]v(ri,W ^{h)) and Lw{rj,W ^{h)) are convex in h. 



Proof The first two expressions follow immediately from ( |62[ ) and (63) by substitu- 
tion. The derivative follows from calculation: JtLiy(??, W~^{h)) = -^{W (h) — r] ■ h) = 

W^^{h) — r] = fj — rj. The convexity follows from the fact that W* is convex (since it is 
the LF dual of a convex function W) and the overall expression is the sum of this and 
a linear term, and thus convex. ■ 



[9 call W the canonical link. We have already seen (Theorem 27) that the composite 
loss constructed using the canonical link is convex. 



C Convexity and Robustness 

In this appendix we show how the characterisation of the convexity of proper losses 



(Theorem 29) allows one to make general algorithm independent statements about the 
robustness of convex proper losses to random mis-classification noise. 

Long and Servedio [H] have shown that that boosting with convex potential functions 



(i.e., convex margin losses) is not robust to random class noise "'^'^ That is, they are 

We define exactly what we mean by robustness below. The notion that Long and Servedio [31] 
examine is akin to that studied for instance by Kearns [25^ . There are many other meanings of "robust" 
which are different to that which we consider. The classical notion of robust statistics |:22i is motivated 
by robustness to contamination of additive observation noise (some heavy-tail noise mixed in with the 
Gaussian noise often assumed in designing estimators). There are some results about particular machine 
learning algorithms being robust in that sense |43| . "Robust" is also used to mean robustness with 
respect to random attribute noise [48] , robustness to unknown prior class probabilities [37] , or a Huber- 
style robustness to attribute noise ("outliers") for classification [T^. We only study robustness in the 
sense of random label noise. 
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susceptible to random class noise. In particular they present a very simple learning 
task which is "boostable" - can be perfectly solved using a linear combination of base 
classifiers - but for which, in the presence of any amount of label noise, idealised, early 
stopping and Li regularised boosting algorithms will learn a classifier with only 50% 
accuracy. 

This has led to the recent proposal of boosting algorithms that use non-convex margin 
losses and experimental evidence suggests that these are more robust to class noise 
than their convex counterparts. Freund [H] recently described RobustBoost, which 
uses a parameterised family of non-convex surrogate losses that approximates the 0-1 
loss as the number of boosting iterations increases. Experiments on a variant of the 
task proposed by Long and Servedio [31^ show that RobustBoost is very insensitive to 
class noise. Masnadi-Shirazi and Vasconcelos [32] presented SavageBoost, a boosting 
algorithm built upon a non-convex margin function. They argued that even when the 
margin function is non-convex the conditional risk may still be convex. We elucidate 
this via our characterisation of the convexity of composite losses. Although all these 
results are suggestive, it is not clear from these results whether the robustness or not is 
a property of the loss function, the algorithm or a combination. We study that question 
by considering robustness in an algorithm-independent fashion. 

For a G (0, \) and rj G [0, 1] we will define 

??Q, := a(l -ri) + {l- a)r] 

as the a-corrupted version of rj. This captures the idea that instead of drawing a positive 
label for the point x with probability r]{x) there is a random class flip with probability 
a. Since rj^ is a convex combination of a and 1 — a it follows that tj^ S [a, 1 — a]. The 
effect of a-corruption on the conditional risk of a loss can be seen as a transformation 
of the loss. 

Lemma 36 is any composite loss then its conditional risk satisfies 

L^{rja,v)=Lt{v,v), r/E [0,1], veV, 
where it{y,v) = (1 - a)£^{y,v) + ai-^i-y^v). 
Proof By simple algebraic manipulation we have 

= [(1 - a)(l - r?) + ari]£^{-l, v) + [a(l - r/) + (1 - a)r]]£^{l,v) 

= (1 - 77)[(1 - a)i^{-l,v) + ae^{l,v)] + r/[a^^(-l, + (1 - a)e'^{l,v)] 

= Ltiv,v) 

proving the result. ■ 

In particular, if £ is strictly proper then cannot be proper because the minimiser 
of L{r]a, ■) is rja and so rja ^ r] must also be the minimiser La(rj, ■). This suggests that 
strictly proper losses are not robust to any class noise. 
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C.l Robustness implies Non-convexity 



We now define a general notion of robustness for losses for class probability estimation. 

Definition 37 Given an a ^ [0, we will say a loss £: {—1, 1} x [0, 1] ^ M is a-robust 
at r] if the set of minimisers of the conditional risk for 7] and the set of minimisers of 
the conditional risk for rja have some common points. 

That is, a loss is a-robust for a particular ij if minimising the noisy conditional risk can 
potentially give an estimate that is also a minimiser of the non-noisy conditional risk. 
Formally, I is a-robust at r] when 



where H{i,ri) is defined in (49). 



ria preserves the side of ^ 
7? < ^ if and only if r/a < | for all 



Label noise is symmetric about ^ and so the map rj 
on which the values rj and r/^ are found. That is, 

a e [0, i). This means that 0-1 misclassification loss or, equivalently, £i is a-robust for 
all r] and for all a. For other c, the range of rj for which ic is a-robust is more limited. 

Theorem 38 For each c G (0, 1), the loss ic is a-robust at rj if and only if 



c — a 
I -2a'' 



, c ) for c<\ 



or rj 



c — a 
1 - 2a 



for c>l. 



Proof By the definition of Lc and [[?) < cj = 1 — [[ry > cj we have 

Lc{v,ri) = {'^-V)4V > cl - c)[?? < cj = rj{l - c) + (c- r/)[[r) > cj. 

Since c — ?? is positive iff c > we see Lc{r], fj) is minimised for rj < c when fj < c and 
for rj > c when fj> c. So H{£c, rj) = [0, c) for 77 < c and H{ic, rj) = [c, 1] for rj > c. Since 
[0, c) and [c, 1] are disjoint for all c G [0, 1] we see that H{£c, rj) and H{£c-, rja) coincide if 
and only if rj,r]a < c ot rj,rja > c and are disjoint otherwise. 

We proceed by cases. First, suppose c < For < c < ^ it is easy to show rja > c 
iff rj > and so £c is not a-robust for rj G [jE^jC). For c < rj we see £c must be 

a-robust since rja < c iff r] < jE§^ but jE^^ < c for c < ^ which is a contradiction. 
Thus, for c < ^ we have £c is a-robust iff 77 ^ [f^' c). 

For c > ^ the main differences are that > c for c > ^ and t^q, < for r/ > ^. 

Thus, by a similar argument as above we see that £c is a-robust iff 77 ^ [c, jE^^)- • 

This theorem allows us to characterise the robustness of arbitrary proper losses by 



appealing to the integral representation in (11) 
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Lemma 39 Ifi is a proper loss with weight function w then H(i, r]) = ^(c)>o ^i^c, ij) 
and so 

c: to(c)>0 

Proof We first show that H(i,r]) C ^(c)>o ^(^cj ^) by contradiction. Assume there 
is an r) G H{i,r]) but for which there is some cq such that w{co) > and fj ^ H(icQ,rj). 
Then there is a ?}' S H^lc^,'/]) and fj' £ H{ic) for all other c for which w{c) > 
(otherwise H{£,r]) = {57}). Thus, L(^{rj,f}') < LcQ{ri,ri') and so Lc{ri,ff') w{c) dc < 
Lcir], fj) w{c) dc since w{co) > 0. 

Now suppose ?7 G f]^. ^(c)>o -^(^c> That is, is a minimiser of Lc{r], •) for all c 

such that u;(c) > and therefore must also be a minimiser of L{r], •) = Lc{r], •) w{c) dc 
and is therefore in H(i,r]), proving the converse. ■ 

One consequence of this lemma is that if w{c) > and ic is not a-robust at rj then, 
by definition, H(ic, rj) n H{ic, Va) = and so £ cannot be a-robust at rj. This means 
we have established the following theorem regarding the a-robustness of an arbitrary 
proper loss in terms of its weight function. 

Theorem 40 If i is a proper loss with weight function w then it is not a-robust for any 



Uc — a \ r c — a \ 



c: ui(c)>0 

By Corollary [30] we see that convex proper losses are strictly proper and thus have 
weight functions which are non-zero for all c G [0, 1] and so by Theorem 40 we have the 
following corollary. 

Corollary 41 If a proper loss is convex, then for all a G (0, ^) it is not a-robust at any 
V e [0,1]. 



At a high level, this result - "convexity implies non-robustness" - appears to be log- 
ically equivalent to Long and Servedio's result that "robustness implies non-convexity". 
However, there are a few discrepancies that mean they are not directly comparable. The 
definitions of robustness differ. We focus on the point-wise minimisation of conditional 
risk as this is, ideally, what most risk minimisation approach try to achieve. However, 
this means that robustness of ERM with regularisation or restricted function classes is 
not directly captured with our definition whereas Long and Servedio analyse this latter 
case directly. In our definition the focus is on probability estimation robustness while 
the earlier work is focussed on classification accuracy. Our work could be extended to 
this case by analysing H{£,r]) n H{ii,rj). 

Additionally, their work restricts attention to the robustness of boosting algorithms 
that use convex potential functions whereas our analysis is not tied to any specific 
algorithm. By restricting their attention to a specific learning task and class of functions 
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they are able to show a very strong result: that convex losses for boosting lead to 
arbitrarily bad performance with arbitrarily little noise. Also, our focus on proper 
losses excludes some convex losses (such as the hinge loss) that is covered by Long and 
Servedio's results. 

Finally, it is worth noting that there are non-convex loss functions that are strictly 
proper and so are not robust in the sense we use here. That is, the converse of Corol- 
lary [41] is not true. For example, any loss with weight function that sits above but 
outside the shaded region in Figure [2] will be non-convex and non-robust. This suggests 
that the arguments made by Masnadi-Shirazi and Vasconcelos [32], Freund for the 
robustness of non-convex losses need further investigation. 
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